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Abstract 


We  have  discovered  a  new  implementation  of  the  qd  algorithm  that  has  a  far  wider 
domain  of  stability  than  Rutishauser’s  version.  Our  algorithm  was  developed  from 
an  examination  of  the  LR-Cholesky  transformation  and  can  be  adapted  to  parallel 
computation  in  stark  contrast  to  traditional  qd.  Our  algorithm  also  yields  useful 
a  posteriori  upper  and  lower  bounds  on  the  smallest  singular  value  of  a  bidiagonal 
matrix. 

The  zero-shift  bidiagonal  QR  of  Demmel  and  Kahan  computes  the  smallest  singu¬ 
lar  values  to  maximal  relative  accuracy  and  the  others  to  maximal  absolute  accuracy 
with  little  or  no  degradation  in  efficiency  when  compared  with  the  LINPACK  code. 
Our  algorithm  obtains  maximal  relative  accuracy  for  all  the  singular  values  and  runs 
at  least  four  times  faster  than  the  LINPACK  code. 

Key  words:  qd,  LR  algorithm,  Cholesky  decomposition,  singular  values,  SVD, 
bidiagonal  matrices 
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1  Introduction  and  Summary 


In  September  1991  J.  W.  Demmel  and  W.  M.  Kahan  were  awarded  the  second  SIAM 
prize  in  numerical  linear  algebra  for  their  paper  ‘Accurate  Singular  Values  of  Bidiagonal 
Matrices’  [4],  referred  to  as  DK  hereafter.  Among  several  valuable  results  was  the  obser¬ 
vation  that  the  standard  bidiagonal  QR  algorithm  used  in  UNPACK  [5],  and  in  many 
other  SVD  programs,  can  be  simplified  when  the  shift  is  zero  and,  of  greater  importance, 
no  subtractions  occur.  The  last  feature  permits  very  small  singular  values  to  be  found 
with  (almost)  aU  the  accuracy  permitted  by  the  data  and  at  no  extra  cost. 

In  this  paper  we  show  that  the  DK  zero  shift  algorithm  can  be  further  simplified  and  this 
simplicity  has  several  benefits.  One  is  that  a  new  algorithm  can  be  implemented  in  either 
parallel  or  pipelined  format  as  an  C?(log2  n)  algorithm.  This  is  pursued  in  a  companion 
paper  [9]. 

Our  investigations  began  with  the  modest  goal  of  showing  that  it  was  preferable  to 
replace  the  DK  zero-shift  QR  transform  by  two  steps  of  zero-shift  LR  implemented  in 
a  qd  (quotient-difference)  format.  Root-free  algorithms  run  considerably  faster  than 
standard  ones.  The  surprise  here  is  that  to  keep  the  high  relative  accuracy  property  it  is 
necessary  to  use  a  little  known  variant  of  qd  (the  differential  form  of  the  progressive  qd 
algorithm  or  dqd  [25],  [24]).  The  standard  qd  will  not  suffice  as  we  show  in  Section  4. 
There  are  no  subtractions  in  dqd.  We  suspect  that  Rutishauser  discovered  dqd  in  1968, 
just  two  years  before  his  death,  and  we  say  more  about  its  history  in  Sections  4  and  11. 

What  we  want  to  stress  here  is  that,  for  reasons  we  may  never  know,  Rutishauser  did  not 
consider  the  shifted  version  of  dqd.  Incidentally  this  differential  qd  is  not  to  be  confused 
with  the  continuous  analogue  of  qd  (see  [21])  and  more  recent  work  on  QR  flows.  The 
trouble  with  the  shifted  version  of  the  ordinary  qd  algorithm  is  that  it  cannot  recover 
from  a  shift  that  is  too  large.  Consequently  qd  algorithms  have  been  shackled  with  very 
conservative  shift  strategies,  such  as  Newton’s  method,  and  earned  the  reputation  of  being 
slow  compared  to  the  QR  algorithm.  Had  Rutishauser  considered  shifts  with  differential 
qd  (dqds  hereafter)  he  would  have  realized,  as  we  soon  did,  that  the  transformation  may 
be  split  into  two  parts.  The  parts  depend  on  whether  the  machine  is  of  sequential  or 
parallel  type  but,  in  each  case,  a  shift  that  is  too  big  reveals  itself  before  the  old  matrix 
is  overwritten  and  so  need  not  be  invoked.  An  unused  shift  is  not  wasted  because  it 
gives  an  improved  upper  bound  on  the  smallest  singular  value  at  a  cost  less  than  one  qd 
transformation  as  well  as  contributing  to  an  improved  shift. 

Our  approach  frees  the  algorithm  to  exploit  powerful  shift  strategies  while  preserving 
high  relative  accuracy  all  the  time.  In  contrast  the  QR  algorithm  delivers  high  relative 
accuracy  only  with  a  zero  shift. 

Even  though  our  algorithms  must  find  the  singular  values  in  order  we  can  use  shift 
strategies  that  are  at  least  quadratically  convergent.  This  is  better  than  fourth  order 
convergence  for  QR.  When  only  the  smallest  few  singular  values  are  needed  this  ordering 
constraint  is  a  great  advantage.  Another  rather  subtle  feature  is  that  it  is  not  necessary 
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to  make  an  extra  0{n)  check  for  splitting  of  the  matrix  into  a  direct  sum.  The  necessary 
information  is  provided  by  the  auxiliary  quantities. 

In  June  1992  we  discovered  that  our  dqds  algorithm  enjoys  high  relative  stability  for  all 
shifts  provided  that  they  avoid  underflow,  overflow  or  divide  by  zero.  Consequently  it  can 
be  used  in  a  variety  of  applications  (eigenvalues  of  symmetric  or  unsymmetric  tridiago¬ 
nals,  zeros  of  polynomials,  poles  and  zeros  of  transfer  functions  and  many  applications 
involving  continued  fractions)  where  Rutishauser’s  qd  has  been  abandoned  because  of 
its  instability  in  the  general  case. 

Our  error  bounds  for  singular  values  are  significantly  smaller  than  those  in  DK  and 
the  approach  is  quite  transparent.  It  was  this  analysis  that  showed  us  the  possibility 
of  violating  positivity  while  still  maintaining  maximal  relative  accuracy  for  all  singular 
values,  not  just  the  small  ones. 

It  gradually  dawned  on  us  as  we  developed  the  algorithm  that  we  were  breaking  away 
from  the  orthogonal  paradigm  that  has  dominated  the  field  of  matrix  computations  (called 
numerical  linear  algebra  by  highbrows)  since  the  1960 ’s.  It  seems  to  be  sacrilegious  to  be 
achieving  greater  accuracy  and  on  average,  a  four  fold  speed-up^  by  simply  abandoning 
QR  for  something  equivalent  to  LR.  See  Section  9.3  for  details.  High  accuracy  comes 
from  the  fact  that  dqds  spends  most  of  its  time  transforming  lower  triangular  2  X  2s  into 
upper  triangular  2  X  2s  by  premultiplication. 

Rutishauser  gave  no  direct  explanation  for  the  way  shifts  are  introduced  into  qd.  We 
have  supplied  one  in  terms  of  matrix  factorizations  in  Section  5.1  and  go  on  to  list  the 
possible  choices  for  a  shift  in  Section  6  and  9. 

Section  3  presents  the  unifying  general  result  which  shows  that  it  is  possible  to  implement 
the  LR-Cholesky  algorithm  of  Rutishauser  [22],  [26]  using  orthogonal  transformations 
only.  Perhaps  this  is  the  key  idea  exploited  in  the  paper.  Since  the  term  LR-Cholesky 
over  describes  the  algorithm  we  simply  refer  to  it  as  the  Cholesky  Algorithm.  Our 
orthogonal  Cholesky  algorithm  is  applicable  to  dense  matrices;  this  more  general  case  is 
studied  elsewhere  [8]. 

We  want  to  point  out  the  unusual  historical  lineage  of  this  algorithm.  The  qd  algorithm 
begat  the  LR  algorithm  which  then  gave  rise  to  the  QR  algorithm  of  Francis.  This  in  turn 
led  to  the  Golub-Kahan  and  Golub-Reinsch  algorithms  for  singular  values  of  bidiagonal 
matrices  which  lead  to  the  DK  zero-shift  variant.  This  inspired  our  orthogonal  algorithm 
of  which  differential  qd  is  the  root-free  version.  We  are  back  to  qd  again  but  with  a  new 
implement  at  ion . 

As  a  service  to  the  busy  readers  we  have  included  a  brief  account  of  the  origins  of  qd 
and  a  summary  of  the  DK  paper.  When  reading  [25]  we  regretted  that  the  link  between 
continued  fractions  and  our  matrices  was  not  made  explicit.  We  provide  the  connection 


^AU  our  computations  are  performed  on  a  DECstation  5000/120  using  double  precision  arithmetic 
(53-bit  mantissa). 
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in  the  final  section. 


2  Notation  and  Normalization 

This  paper  does  not  involve  vectors  very  much  and  so  we  do  not  follow  Householder 
conventions.  However  capital  roman  letters  denote  matrices  while  lower  case  Roman  and 
Greek  letters  denote  scalars.  On  the  rare  occasions  when  a  vector  is  needed  it  is  denoted 
by  a  lower  case  roman  letter  in  boldface. 

As  usual  the  singidar  values  of  an  n  x  n  matrix  C  are  arranged  in  monotone  decreasing 
order  and  denoted  by  0*2, their  union  is  (t[C]. 


•  We  make  reference  to  the  QR  factorization  of  a  matrix.  This  is  the  matrix  form  of 
the  Gram-Schmidt  orthonormalizing  process  applied  to  the  columns  of  the  matrix 
in  natural  order.  By  convention  the  diagonal  entries  of  the  upper  triangular  factor 
R  are  taken  nonnegative.  See  Golub  and  Van  Loan  [11]  for  details. 

•  We  make  reference  to  the  Cholesky  factorization  of  a  positive  definite  matrix  into 
the  product  of  a  lower  triangular  matrix  and  its  conjugate  transpose.  The  factors 
are  unique. 

•  We  make  references  to  the  LR  and  QR  algorithms.  These  are  defined  in  the  appro¬ 
priate  places. 


We  shall  be  concerned  mainly  with  bidiagonal  matrices  which  we  call  B  and  take  them 
to  be  upper  bidiagonal.  To  save  space  we  write  the  bidiagonal  matrix 


B  = 


ai  61 

(I2  &2 


fin-1  bn-i 
fin 


as 


B  =  hidiag  \ 


b: 


[  fii 


bn-2 


bn-1 


fin-1 
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2.1  Normalization 


Consider  now  the  effect  of  a  zero  value  among  the  parameters  of  n  x  n  bidiagonal  B. 


2.1.1  Superdiagonal 


Suppose  that  bt  =  0,  k  <  n.  Then  B  may  be  written  as  a  direct  (or  diagonal)  sum  of 
two  bidiagonals  J?i  and  B^.  Moreover 

a[B]  =  o[Bi]  U  <r[B2]. 

This  case  makes  the  calculation  of  singular  values  easier.  Even  more  important  is  the 
fact  that  our  algorithms  do  not  suffer  from  the  failure  to  detect  such  a  split  when  it 
occurs.  However,  the  transition  from  a  linearly  convergent  shift  to  a  quadratic  shift  will 
not  occur  if  the  split  lies  undetected  for  too  long. 


2.1.2  Diagonal 


Let  0*  =  0,  A:  <  n.  Since  |  det  B\  =  0"=!  I  N  n"=i  it  follows  that  <t„  =  0.  However 
some  work  is  needed  in  order  to  deflate  this  value,  i.e.  to  find  a  new  B  of  order  n  —  1 
yielding  the  remaining  singular  values  of  B.  In  exact  arithmetic  one  iteration  of  any  of 
the  unshifted  algorithms  given  later  is  guaranteed  to  produce  the  desired  B  and  so  this 
case  does  not  need  special  treatment.  The  zero  diagonal  entry  may  be  driven  to  the 
closest  end  of  the  matrix. 

If  a*  =  0,  A:  <  n,  at  one  step  of  our  algorithm  and  if  a„  =  0  at  the  next  step  then 
will  also  vanish  and  so  produce  a  split  into  two  bidiagonals. 


2.1.3  Signs 


If  the  matrix  is  real,  then  using  pre  and  post  multiplication  by  matrices  of  the  form 
diag{±.\}  any  sign  pattern  may  be  imposed  on  the  entries  of  B  without  changing  the 
singular  values.  If  the  matrix  is  complex,  then  it  could  be  transformed  to  a  real  matrix 
by  pre  and  post  multiplication  by  matrices  of  the  form  diag{exp(iu)}  where  =  —1  and 
u>  is  real. 

There  is  little  loss  of  generality  in  assuming,  when  necessary,  that  B  is  of  real  positive 
type;  all  its  parameters  exceed  0.  However  in  Section  5.3  we  address  the  practical  question 
of  when  to  relax  the  requirement  of  positivity. 
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3  Orthogonal  Form  of  the  Cholesky  Algorithm 


The  result  given  in  the  theorem  below  is  implicit  in  proofs  that  one  step  of  the  QR 
algorithm  is  equal  to  two  steps  of  the  Cholesky  algorithm.  Nevertheless  it  appears  not 
to  have  been  stated  explicitly  before  and  was  not  known  to  several  experts  whom  we 
consulted.  So  for  the  next  few  paragraphs  we  consider  full  complex  matrices.  Recall 
that  the  Cholesky  factorization  of  a  positive  definite  Hermitian  matrix  i4(=  A*)  may  be 
written  as  A  =  LL*  where  L  is  lower  triangular. 

Definition.  The  Cholesky  transform  of  A  =  LL*  is 

A  :=  L*L 


The  Cholesky  algorithm,  consisting  of  successive  applications  of  the  Cholesky  transfor¬ 
mation,  is  a  special  case  of  the  LR  algorithm. 

We  now  consider  the  relation  between  L  and  L,  the  Cholesky  factors  of  A  and  A,  respec¬ 
tively. 


Theorem  1  Let  A  =  LL*  be  the  Cholesky  factorization  of  the  Cholesky  transform  of 
positive  definite  A  =  LL*.  Then 


L  =  QL* 


is  the  QR  factorization  of  L. 
Some  may  prefer  the  formulation 

with  A  =  R*R  and  A  =  R*R. 


R*  =  QR 


Proof.  Since  A  is  positive  definite  all  factors  mentioned  below  are  unique.  By  definition 
of  Z 

L*L  =  a*. 


We  seek  invertible  F  such  that 

L  =  Ft, 


(1) 


L*  =  LF-\  (2) 

Transpose  and  conjugate  (1)  and  use  invertibility  of  L  in  (2)  to  find 

F*  =  L-^L*  =  F-K 

So  F  is  unitary  and  since  t  is  upper  triangular  with  positive  diagonal  Equation  (1) 
above  gives  the  QR  factorization  of  L,  as  claimed.  • 
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The  theorem  shows  that  L  may  be  obtained  from  L  by  orthogonal  transformations  with¬ 
out  forming  A.  Moreover  just  as  QR  may  be  performed  with  column  pivoting  so  can  we 
obtain  the  Cholesky  factor  of  a  permutation  of  A.  A  general  application  of  Theorem  1 
is  presented  in  Fernando  and  Parlett  [8]  but  here  we  return  to  the  bidiagonal  case. 

The  basic  equation  LL*  =  VL  guarantees  that  the  Cholesky  algorithm  preserves  band¬ 
width.  In  particular,  bidiagonal  B  gives  rise  to  tridiagonal  A  —  B*B  and  a  bidiagonal 
B.  In  order  to  study  how  B  is  derived  from  B,  let 


B  =  biding  / 

\  ai  02 


bn-2  bn-l 

On-l  ®n 


B  =  biding  |  ^  . 

^2  •  •  ^n— 1 


^n-2  ^n-1 


where  B*B  BBK  By  Theorem  1 


B*  =  QB. 


The  matrix  Q  may  be  written  as  a  product  of  (n  —  1)  plane  rotation  matrices  [11], 

Q  =  G1G2 . .  .Gn-l* 

Before  the  annihilation  of  the  subdiagonal  element  the  active  part  of  the  matrix  is  of 
the  form, 


0  bk^i 

0  djb  0 

bk  o,k^i  0 

bk-^l  <^k+2 

and  after  the  plane  rotation  GJ.,  the  matrix  becomes 


(3) 


0  6*-! 

0  o*  61; 

0  Ofc+i  0 

i't+l  fflib+2 

Formally  we  may  set  B^°^  =  B*  and,  for  A;  =  1, . . . ,  n  —  1 


(4) 


(5) 
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Finally  B  =  and,  from  (3)  and  (4),  with  Oi  =  oj  and  c\-\-  s\  =  1, 


dk 

=  \/ak+bk  =  ak/ck 

(6) 

Sk 

=  bk/dk 

Ck 

=  dk/dk 

(7) 

bk 

=  ^kOk+l  =  bkOk+lldk 

(8) 

®ib+l 

=  CkOk+i  =  dikOk+iIdk. 

There  is  some  redundancy  in  the  equations  given  above  but  their  most  important  property 
is  the  absence  of  subtractions.  This  ensures  high  relative  accuracy  in  the  new  entries  a,- 
and  hi.  Observe  that  neither  sj,  nor  c*  is  needed  explicitly  to  compute  the  new  entries. 
To  the  best  of  our  knowledge  the  algorithm  given  below  is  new.  For  reasons  that  appear 
in  the  next  section  we  call  it  the  Orthogonal  qd-Algorithm  or  oqd.  It  is  convenient  to 

use  _ 

cabs{x,  y)  =  \/x^  +  (9) 

whose  name  stands  for  the  complex  a6solute  value  of  x  +  iy.  In  numerical  computing 
(e.g.  Eispack),  an  alternative  name  for  cabs  is  pythag. 


Algorithm  1  (oqd) 

a  :=  Oi 

fork  =  1,71  —  1 
djfc  :=  cabs{a,  6*) 
hk  •—  bk  *  Oik) 

a  :=  d*(ajt+i/aib) 
end  for 
d„  :=  a 


This  algorithm  will  undergo  several  transformations  in  the  following  pages  before  we  are 
ready  to  implement  it.  Nevertheless,  even  at  this  stage,  two  applications  of  it  are  slightly 
better  (fewer  multiplications)  than  the  DK  Zero  Shift  QR  algorithm  [4]  described  briefly 
in  our  Section  10. 

The  inner  loop  comparisons  given  in  Table  1  are  based  on  one  QR  step  which  is  equal 
to  two  LR  steps.  We  have  taken  into  account  the  common  sub-expression  ajt+i/djt  in  the 
estimation  of  the  complexity  of  oqd  (Algorithm  1). 

DK  uses  six  auxiliary  variables  while  oqd  needs  only  one.  The  memory  traffic  is  es¬ 
sentially  determined  by  the  number  of  variables,  arithmetic  operations  and  assignment 
statements.  In  most  advanced  architectures,  memory  access  is  more  expensive  than 
floating-point  operations  and  in  such  machines  the  oqd  will  be  very  advantageous  be¬ 
cause  of  fewer  read  and  write  operations. 
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DK 

oqd 

cabs 

2 

1*2 

divisions 

2 

1*2 

multiplications 

6 

2*2 

conditionals 

1 

0 

assignments 

7 

3*2 

auxiliary  variables 

6 

1 

Table  1:  Complexity  of  Demmel-Kahan  and  oqd 

4  The  Quotient  Difference  Algorithm 

It  is  easy  to  avoid  taking  the  square  roots  that  appear  in  oqd  (Algorithm  1)  .  Define 
bn  :=  0  and  q*  =  ,  e*  =  6^  ,  A:  =  1,2, , .  .,n  .  By  simply  squaring  each  assignment 

in  oqd  (Algorithm  1)  one  obtains  an  algorithm  that  turns  out  to  be  a  little  known 
variant  of  the  quotient  difference  algorithm.  Rutishauser  developed  his  qd  algorithm  in 
several  papers  from  1953  or  1954  (e.g.  [20])  until  his  early  death  in  1970  but  this  variant 
appeared  in  English  only  in  1990  in  [25]  which  is  the  translation  of  the  German  original 
[24]  published  in  1976.  The  full  list  of  the  papers  on  qd  by  Rutishauser  can  be  found  in 
the  above  mentioned  books  which  were  published  posthumously. 

In  the  notes  at  the  end  of  [20]  and  at  the  end  of  volume  2  of  [24]  this  variant  is  called 
the  differential  form  of  the  progressive  qd  algorithm  or  dqd.  These  notes  were  based  on 
unfinished  manuscripts  of  Rutishauser. 


Algorithm  2  (dqd) 


d  :=  qi 

fork  :=  l,n  —  1 
gfc  :=  d  +  e* 
h  ek  *  (qk+i/qk) 
d  :=  d  *  (qk+i/qk) 
end  for 
q„  :=  d 


The  implementation  of  dqd  (Algorithm  2)  requires  only  1  division,  2  multiplies,  and  1 
addition  in  the  inner  loop.  No  subtractions  occur. 

The  intermediate  variable  d  may  be  removed.  At  step  k,  d=  dk  and  the  trick  is  to  write 
it  as  a  difference. 


dk+i  =  dlqk+i  =  qk+i  —  slqk+i  —  9i+i  —  cj,. 
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Algorithm  3  (qd) 


eo  =  0 

fork  :=  l,n  —  1 

Qk  •=  (9k  —  ^k-l)  +  ^k 
ik  :=  Cfc  *  qk+i/4k 
end  for 

9n  •”  Qn  ~  ®n  — 1 


Table  2  compares  the  complexity  of  orthogonal,  differential  and  standard  qd  algorithms. 


oqd 

dqd 

qd 

cabs 

1 

0 

0 

divisions 

2 

1 

1 

multiplications 

4 

2 

1 

additions 

1 

1 

1 

subtractions 

0 

0 

1 

assignments 

3 

3 

2 

auxiliary  variables 

1 

1 

0 

Table  2:  Complexity  of  oqd,  dqd  and  qd 


We  hasten  to  add  that  Rntishauser  did  not  derive  the  qd  algorithm  from  our  Theorem  1 
but  from  ideas  described  in  Section  11. 

For  positive  5,  dqd  and  qd  are  stable  in  the  sense  that  all  intermediate  quantities  are 
bounded  by  ||J9|p.  Singular  value  errors  provoked  by  finite  precision  arithmetic  will  be 
tiny  compared  to  <Tj.  This  is  satisfactory  for  many  purposes  and  it  was  not  generally 
appreciated  until  the  DK  paper  appeared  that  bidiagonal  matrices  do  determine  all  their 
singular  values,  however  small,  to  the  same  relative  precision  enjoyed  by  the  matrix 
entries.  Since  such  accuracy  can  be  achieved  for  little  extra  cost  it  seems  only  right  to 
do  so.  These  considerations  lead  us  to  abandon  qd  and  concentrate  on  dqd  and  oqd. 

Example  1  Here  is  a  bidiagonal  Toeplitz  matrix  with  a,-  =  1,  6,-  =  256  (q,-  =  1,  e,-  = 
65536)  for  all  i.  The  results  of  our  dqd  algorithm  are  given  in  Table  3.  Note  that 
y/qcA  =  1.9093060930437717  x  10“^®^  w  2“®°^  gives  correct  to  fuU  machine  precision. 

The  results  for  qd  were  identical  to  dqd  except  that  the  crucial  element  §64  became  zero 
in  both  steps.  Hence  qd  is  not  suitable  for  computation  of  small  singular  values  with 
high  relative  accuracy.  • 

Example  2  We  have  rerun  Example  1  but  with  a  smaller  value  of  (n  =  5)  and  the 
results  are  given  in  Table  4.  For  this  example,  CTs  =  =  2.3282709094019085  x 

10“^°  which  is  correct  to  fuU  machine  precision.  For  comparison,  the  answer  given  by 
the  LINPACK  SVD  routine  dsvdc  (which  is  based  on  the  Golub-Reinsch  algorithm)  is 
2.3282704794711363  x  10“^°  which  gets  7  of  the  15  digits  correct. 
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after  the  first  pass 

after  the  second  pass 

Qi 

92 

93 

94  to  qes 
964 

6.55370000000000000+04 

6.5536000015258556D+04 

6.5536000000000233D+04 

6.5536000000000000D+04 

3.6455053829317361D-304 

6.5537999984741444D+04 

6.5536000061032595D+04 

6.5536000000001397D+04 

6.5536000000000000D+04 

3.6454497569340717D-304 

ei 

€2 

€3 

€4  to  652 

^63 

9.9998474144376459D-01 

9.9999999976717291D-01 

9.9999999999999645D-01 

l.OOOOOOOOOOOOOOOOD+00 

1  .OOOOOOOOOOOOOOOOD+00 

9.9995422572819948D-01 

9.9999999883589297D-01 

9.9999999999997513D-01 

l.OOOOOOOOOOOOOOOOD+00 

5.5625997664363648D-309 

Table  3:  Numerical  results  for  Example  1 


Using  qd  we  got  almost  identical  results  except  that  is  zero  in  both  sweeps.  Thus, 
CTs  is  zero  according  to  the  qd  algorithm.  Thus,  qd  does  not  deliver  as  much  accuracy 
as  Golub-Reinsch;  in  fact  it  can  be  shown  that  qd  sometimes  delivers  zero  for  singular 
values  as  large  as  y/ macheps  *  ||i?|l.  • 


after  the  first  pass 

after  the  second  pass 

91 

92 

93 

94 

95 

6.5537000000000000D+04 

6.5536000015258551D+04 

6.5536000000000238D+04 

6.5536000000000000D+04 

5.4209281443662679D-20 

6.5537999984741449D+04 

6.5536000061032593D+04 

6.5536000000001395D+04 

6.5536000000000000D+04 

5.4208454275671899D-20 

62 

63 

64 

9.9998474144376457D-01 

9.9999999976717293D-01 

9.9999999999999642D-01 

l.OOOOOOOOOOOOOOOOD+00 

9.9995422572819948D-01 

9.9999999883589292D-01 

9.9999999999997509D-01 

8.2716799077854419D-25 

Table  4:  Numerical  results  for  Example  2 


Some  people  do  not  like  root  free  algorithms  (e.g.  dqd)  because  they  limit  the  domain  of 
the  matrices  to  which  they  can  be  applied.  For  example,  a  bidiagonal  B  whose  singular 
values  vary  from  10®°  to  10“®°  could  be  diagonalized  in  single  precision  on  an  IBM 
machine  by  oqd  (Algorithm  1)  but  not  by  dqd  (Algorithm  2)  because  of  overflow  and 
underflow. 


We  conclude  this  section  by  pointing  out  that  qd  (Algorithm  3),  the  standard  qd  algo¬ 
rithm,  consists  of  the  so-called  rhombus  rules  arranged  in  computational  form  and  these 
rules  are  a  direct  consequence  of  the  deflning  equation 

BB*  =  B'B. 


Equate  the  {k,  k)  entry  on  each  side  to  obtain 


al  +  bl  =  bl_,  -K  al 


(10) 
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9k  +  c*  =  et_i  +  Qk. 

and  equate  the  (fc,  A:  +  1)  entry  on  each  side  to  obtain 

hfik+i  =  ^kbk 


(11) 


e*9k+i  =  qk^k- 


The  rhombus  rules  can  be  also  derived  from  =  QB  hy  noting  that  orthogonal  transfor¬ 
mation  Q  changes  neither  the  norms  nor  the  inner  products  of  the  columns.  The  reason 
for  the  name  rhombus  rule  is  indicated  in  Figure  3  of  Section  11. 


5  Incorporation  of  Shifts 


Rutishauser  introduced  shifts  into  the  qd  almost  from  the  beginning  and  we  could  sim¬ 
ply  quote  him.  Unfortunately  he  does  not  give  any  explanation  of  how  he  derived  the 
appropriate  modification  of  qd  (given  in  Section  4).  So  we  provide  one  at  the  end  of 
Section  5.1. 


5.1  Shifted  qd  Algorithms 

In  eigenvalue  calculations,  shifts  are  natural  and  can  be  easily  incorporated  since 

A(A  -  T^I)  =  X{A)  -  r" 

where  is  the  shift  and  A(A)  indicates  an  eigenvalue  of  A.  Thus,  by  subtracting  from 
the  diagonals  of  the  matrix,  we  can  introduce  origin  shifts  into  the  Cholesky  algorithm. 

A  shift  T  can  be  introduced  into  oqd  (Algorithm  1,  Section  3)  by  modifying  statements 
involving  a  and  d. 


d  l —  CLi 

fork  =  l,n  —  1 
CLk  :=  y/a^  +  bl  - 
bk  •=■  bk  *  (a*+i/ak) 
a  yja?  -  *  (ak+i/ok) 

end  for 

a„  :=  —  r* 


Algorithm  4  (oqds) 
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It  may  be  verified  that  B*B  =  BB^  -  To  keep  B  real  the  shift  must  satisfy 

r  <  cr„[J5]  (12) 

but  this  constraint  is  not  formally  necessary  for  dqd  (Algorithm  2)  which  uses 

Qk  '-=  dk  +  Cie  —  • 

Algorithm  5  (dqds) 

d  :=  ft  - 
fork  :=  l,n  —  1 
qk:=d+  e* 
e*  :=  e*  *  (ft+i/ft) 
d:=  d*(qk+i/qk)-T^ 
end  for 
q„  :=  d 


The  constraint  (12)  is  also  unnecessary  for  qd. 


Algorithm  6  (qds) 

eo  =  0 

fork  :=  l,n  —  1 

ft  ;=  (ft  -  ft-i)  +  Ci  - 
ik  :=  e*  *  ft+i/ft 
end  for 

Qn  :=  qn  -  in-l  “ 


All  that  is  lacking  is  an  analogue  of  the  orthogonal  connection  (Theorem  1) 

=  QB. 

For  that  it  is  necessary  to  abandon  square  matrices  and  write 


0 


=  Q 


B 

tI 


The  new  Q  is  2n  x  2n  and  is  not  unique.  However  its  first  n  rows  are  uniquely  determined 
by  B  and  r  for  r  <  cr„[B]. 


It  is  at  this  point  that  the  superiority  of  the  qd  formulation  becomes  clear.  DK  showed 
that  the  standard  Golub-Reinsch  bidiagonal  QR  algorithm  may  be  simplified  when  the 
shift  is  zero;  see  Section  10  for  the  details.  Our  algorithms  (1,2,  or  3)  are  already 
simpler  than  the  DK  zero  shift  QR  and  they  also  permit  use  of  a  non-zero  shift  with  no 
impediment  to  pipelined  or  parallel  implementation  or  high  relative  accuracy.  See  [9]  for 
details.  This  is  strong  evidence  that  our  formulation  is  the  natural  one. 
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5.2  The  Two  Phase  Implementation 


At  first  sight  the  auxiliary  quantities  i  =  that  occur  in  dqd  are  seen  as 

the  price  to  be  paid  for  securing  high  relative  accuracy.  On  further  consideration  they 
may  be  seen  as  an  attractive  feature  that  permits  an  aggressive  shift  strategy  that  also 
preserves  high  relative  accuracy  in  the  computed  singular  values.  Moreover,  as  an  extra 
bonus,  we  find  that  the  vector  d=  (dj, . . . , d„)  may  be  computed  in  C?(log2  n)  steps  in  a 
parallel  computer  using  the  technique  called  parallel  prefix  operation  in  computer  science 
writings,  see  [3]. 


Consider  next  the  implementation  of  dqds.  The  auxiliary  quantities  di  may  be  computed 
prior  to  any  modification  of  q  and  e  since 


=  dkQk+lMk—T^ 

=  <ft9ii+i/(djb  +  Ct)  —  r^. 


An  alternative  formulation  is 

dk+i 


gfc+i 

1  +  Ck/dk 


but  a  division  costs  more  than  a  multiplication. 


(13) 

(14) 


It  is  at  this  point  that  one  sees  the  advantage  of  arithmetic  units  that  conform  to  the 
ANSI/IEEE  floating  point  standard  754:  there  is  no  need  to  test  at  each  instance  of  (13) 
or  (14)  to  prevent  division  by  zero.  The  occurence  of  a  A:  with  d*  =  oo  does  no  harm.  It 
signals  that 


and  the  transformation  of  B  to  B  (Phase  2)  should  not  be  completed.  The  effort  in 
running  (13)  is  not  wasted  because  it  yields  a  new  upper  bound  on  CTnlB], 

Using  (13),  d*  =  oo  yields  djt+i  =  oo/oo  =  NaN  (not  a  number)  and  then  q,-  =  NaN  for 
*  >  ^  Using  (14),  djt  =  oo  yields  dfc+i  =  ~  which  is  a  better  answer. 


5.3  Almost  Positive  Bidiagonals 

The  standard  qd  algorithm  is  well  defined  for  most  shifts  but  it  may  not  be  stable  in 
an  absolute  sense;  i.e.  the  new  array  {q,e}  may  be  far  greater  than  old  one  {q,e}, 
Rutishauser  proved  stability  under  the  assumption  of  positivity  and  took  great  care  in 
his  implementation  to  preserve  this  property. 

Our  dqds  algorithm  has  the  advantage  of  maintaining  relative  stability  in  the  positive 
case  and,  fortunately,  even  beyond.  We  currently  impose  the  requirement 

+  e„_i 


14 


where  5„_i  is  the  leading  principal  submatrix  of  because  it  ensures  that  the  only 
entries  in  {q,  e}  that  could  go  negative  are  e„_i  and  g„.  Our  goal  is  to  choose  r  (actually 
r^)  to  make  as  small  as  possible  and  hence 

T*  »  dn  =  9n(l  -  e„_i/9„_i). 

Notice  how  strongly  d„  depends  on  sign{en-i)  and  sign(qn)  since  ^n-i?  though  unknown, 
remains  positive.  There  are  four  possible  configurations  in  the  asymptotic  regime  (r^  < 
|d„_i+e„_i)  and  we  designate  them  by  sign  pairs:  (si</n(e„_i),  sign(q„)).  Each  time  that 
dqds  is  invoked  there  is  no  doubt  about  sign{en-i)  but  sign(q„)  will  not  be  predictable 
since  the  aim  is  to  have  =  0. 

A  careful  study  of  the  last  three  assignments  in  dqds  shows  the  following  possible  paths 
the  iteration  could  follow.  Since  we  do  not  expect  more  than  2  steps  before  convergence 
(and  deflation)  some  edges  may  not  be  traversed. 


If  t2  <  a„ 
(+»+) 
(+,-) 
(-,+) 

If  t2  >  <Tn 

(  +  5+) 
(+,-) 
(- +) 
(- -) 


6  Bounds  for 

6.1  A  Posteriori  Bounds  for  the  Smallest  Singular  Value 

Our  oqd(Algorithm  1  in  Section  3)  transforms  5  to  P  by  making  use  of  n  auxiliary 
quantities  dk,k  =  l,n.  It  is  possible  to  give  a  nice  interpretation  of  the  dk  that  leads  to 
useful  bounds  on  <Tm<n*  This  result  was  also  obtained  by  Rutishauser  but  his  treatment 
was  not  based  on  orthogonal  rotations  although  he  knew  the  matrix  interpretation  of  qd. 

If  we  think  of  the  matrix  B*  being  transformed  into  B  one  column  at  a  time  in  (n  — 
1)  little  steps  then  at  the  end  of  Step  (k  —  1)  row  A:  is  a  singleton.  That  is  the  key 
technical  observation.  To  describe  the  situation  we  refer  back  to  Section  3  and  let  Q*  = 


(-  -) 
(“»+) 


— (+>+) 

—  (-  +) 

— 

— (+>+) 
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(G1G2  . . .  Gk^iY  product  of  the  first  (k  —  1)  plane  rotations  used  in  the  reduction 

process.  Thus 


=  Q^B* 


61 

0 


61 

O2  ^2 
0  . 

0  aib_i 
0 


h-i 

h  0 

bk  Ot+i  0 
bk+l  Oi+2 

0 

bn-2  Ofi-l  0 
^n— 1  ®r» 


(15) 


Note  that  QkB*  coincides  with  B  in  rows  1, 2, . . . ,  A:  —  1  and  with  B*  in  rows  fc  + 1, . . . ,  n 
while  orthogonal  Qk  coincides  with  I„  in  rows  fc  +  1, . . . ,  n. 


Theorem  2  (Bounds  for  Omin  without  shifts)  Apply  the  dqd  transformation  to  a 
positive  bidiagonal  B  (see  Algorithm  1)  to  produce  B  and  di, d2, . . . ,  o„.  Then 


1.  <7„  <  mint{afc} 

[{BBT%.k  =  ar" 

Proof:  Since  singular  values  are  invariant  under  orthogonal  transformations  and  trans-  . 
position 

crn[B]  =  <Tn[QkB*]  <  ||w3^(5t5‘||  =  d* 

where  Uk  is  the  fcth  column  of  the  identity  matrix.  The  fcth  row  of  QkB*  is  a  singleton; 

u*f.QkB*  =  djbuj,. 

Transposing  and  rearranging  gives 

ak^QkUk  =  B~^Uk 
d,-2  =  {B-*B-%,k 

as  claimed.  Note  that 

cr-2  <  =  ||£“^||^  =  trace[{BB*)-^]  =  ^d.r^ 

«=i  »=i 

Finally  we  get  the  required  res'ilt  by  considering  the  one  and  two  norms  of  the  vector 

(dr\d2\...d-^).  • 
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We  can  compute  bounds  on  even  when  the  algorithm  is  used  with  shifts  r  pro¬ 

vided  that  r  <  <Tmin[B].  Formally  the  reduction 


requires  2(n  —  1)  plane  rotations  (not  just  n  —  1)  because  the  rotation  Gi  in  (t,  i  +  1) 
must  be  preceded  by  a  rotation  Gi  in  plane  (i,  n  -f  i)  in  order  to  introduce  r  into  position 
(n  -I-  i,  i).  Thus  the  rows  1, . . . ,  fc  -  1  and  n-|-l,  ...,n-|-fc-lof 

<^‘(0)  -  (fO 

are  coincident.  Also  the  rows  A:  +  1, . . . ,  n  and  n  +  fc, .  • . ,  2n  of 


are  the  same.  However  row  k  is  still  a  singleton  in  fa.ct 


Theorem  3  (Bounds  for  CTmin  with  shifts)  If  the  dqds  algorithm  with  shift  r  trans¬ 
forms  positive  bidiagonal  B  into  positive  B  with  auxiliary  quantities  Si, . .  .,d„  then 

1.  (T„  <  minfc{at} 

i 

where,  in  (16),  u^Qk  :=  ^  V  have  n  entries. 


Proof:  Since  singular  values  are  invariant  under  orthogonal  transformation  and  transpo¬ 
sition, 

OnlB]  =  Qi  ^  ^  ^  ^  ^  II  =  a*. 

The  last  equality  uses  (16).  To  establish  the  second  result  transpose  (16)  to  obtain 

^  B  0  Qluk  =  Bxk  == 


Ukttk. 


Since  B  is  invertible, 


a^^Xk  =  B  ^Uk, 


Remark:  Since  the  d}.  are  monotone  decreasing  in  r  a  successful  dqds  transformation 
produces  a  better  upper  bound  and  a  worse  lower  bound  than  does  dqd.  Fortunately  it 
is  the  upper  bound  that  plays  an  active  role  in  our  implementation. 
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6,2  The  Newton  shift 


The  shift  strategy  used  by  Bauer  to  accelerate  the  rational  QR  algorithm  RATQR  is  also 
closely  related  to  part  3  of  the  above  theorem.  See  [1],  [19]. 

We  recall  that  the  Newton  shift  from  0  for  the  characteristic  polynomial  of  any  matrix 
A  is  related  to  the  trace  of  the  inverse.  Let 

XA{t)  =  det[tl  -A]  =  f[(t  -  A,). 

t=i 

Then,  by  logarithmic  differentiation 


In  particular 

n 

=  —  ]^A“^  =  —traceA~^ 

t=i 


Xa(0) 

X/i(0) 


because  the  spectrum  of  A  ^  is  {A^ 


In  our  case  is  the  Newton  correction  from  0  towards 


6.3  The  (l,oo)  Bound 


The  DK  paper  also  provides  lower  bounds  on  <t„.  Two  recurrences  (see  Section  10  for 
details)  produce 

min  A,-  = 


and 

vainnj  = 

Then 

Since  ||C||  <  for  any  square  matrix  C,  we  can  improve  the  DK  bound  to 

give, 
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6.4  The  Johnson  Bound 


For  a  general  complex  matrix  C,  a  Gersgorin-type  bound  for  (Tm,„  is  given  by  Johnson 
(see  [14]), 

Omin  >  max{0, 0} 

where 

=  nun  I  ^  X) 

‘  (  ^  k^i 

For  a  positive  bidiagonal  B,  this  simplifies  to 

9  =  min  |c,-  -  i(6,-  +  6i_i)| 

and  ultimately  this  becomes 

0  =  an- 


7  Effects  of  Finite  Precision 

7.1  Error  Analysis  -  Overview 

One  of  the  benefits  of  the  simplicity  of  our  algorithms  oqd  and  dqd  is  that  their  anal¬ 
ysis  is  relatively  easy.  The  DK  zero  shift  QR  transformation,  though  simpler  than  the 
Golub/Reinsch  transformation,  is  complicated  enough  to  defy  anything  but  a  forward 
error  analysis.  After  heroic  struggles  with  innumerable  details  DK  establish  the  error 
bound  quoted  in  Section  10.4. 

When  discussing  this  result  and  our  own  analyses  it  is  convenient  to  use  the  acronym 
ulp  which  stands  for  units  in  the  iast  place  held.  It  is  the  natural  way  to  refer  to  relative 
differences  between  numbers.  When  a  result  is  correctly  rounded  the  error  is  not  more 
than  half  an  ulp.  In  this  section  we  usually  omit  the  ubiquitous  phrase  ‘at  most’  qualifying 
errors  and  modifications. 

Our  algorithms  still  do  not  admit  a  pure  backward  error  analysis,  the  computed  output 
B  is  not  the  exact  output  from  a  matrix  very  close  to  B.  Nevertheless  we  can  use  a 
hybrid  interpretation  involving  both  backward  and  forward  interpretation. 

Whereas  DK’s  zero  shift  guarantees  that  each  computed  singular  value  is  in  error  by  no 
more  than  69n^  nips  our  dqds  algorithm  causes  no  more  than  An  u/ps  change  using  any 
properly  chosen  shift.  However  the  main  point  is  that  our  analysis  is  easy  to  grasp. 

The  next  subsection  establishes  this  strong  property  of  dqds.  A  similar  result  holds  for 
oqds  but  the  square  roots  and  squaring  provoke  a  slightly  larger  bound. 
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The  trick  of  the  proof  is  to  define  B  so  that  the  computed  auxiliary  quantities  {d,}  are 
exact  outputs  of  dqds.  The  difference  between  B  and  B  is  the  forward  error. 

At  the  beginning  of  the  paper  we  made  much  of  the  fact  that  algorithms  oqd  and  dqd 
required  no  subtractions.  Yet,  in  the  interest  of  efficiency,  we  have  introduced  shifts  and 
quietly  brought  back  subtraction.  The  nairacle  is  that  the  subtraction  is  in  the  d’s  and 
does  not  impair  the  high  relative  accuracy  property.  However  qd  does  not  guarantee 
high  relative  accuracy  so  long  as  q’s  are  dominated  by  neighbouring  e’s. 

Since  no  intermediate  quantities  exceed  ai,  it  is  assumed  that  the  initial  data  are  scaled 
so  that  (Ti  (or  erf  for  dqds)  is  close  to  the  overflow  threshold.  Underflow,  though  possible, 
is  then  a  rare  event. 

Finally  we  remind  the  reader  that  the  symbol  =  carries  its  normal  mathematical 
meaning. 


7.2  High  Relative  Accuracy  in  the  Presence  of  Shifts 


We  refer  the  reader  to  Section  5.3  where  almost  positive  bidiagonals  are  introduced. 
Rutishauser  merges  the  g’s  and  c’s  into  a  single  array  Z; 

and  this  is  a  convenient  notation  for  the  analysis  which  follows. 

Before  stating  our  claim  we  need  more  notation  because  the  difficulty  in  the  analysis  is  one 
of  interpretation.  Given  Z  the  dqds  algorithm  in  finite  precision  arithmetic  produces 
representable  output  Z.  We  introduce  two  ideal  arrays  Z  and  Z  such  that  Z  is  the 
output  of  dqds  with  shift  r  acting  on  Z  in  exact  arithmetic.  Moreover  Z  is  a  small 
relative  perturbation  of  Z  and  Z  is  a  small  relative  perturbation  of  Z.  See  Figure  1. 

Our  model  of  arithmetic  is  that  the  floating  point  result  of  a  basic  arithmetic  operation 
o  satisfies 

fl{x  0  p)  =  (a:  0  p)(l  +  =  (a:  0  p)/(H-  6)  (17) 

where  77  and  6  depend  on  x,  y,  and  o,  and  the  arithmetic  unit  but  satisfy 

I77I  <  6  ,  1^1  <  6 

for  a  given  e  that  depends  only  on  the  arithmetic  unit.  We  shall  choose  freely  the  form 
(»7  or  6)  that  suits  the  analysis. 


A  fairly  simple  result  is  possible  because  the  only  truly  sequential  part  of  dqds  is  the 
sequence  {</<}”.  Note  that,  in  exact  arithmetic 


djj+i  — 


^kQk+l 
dk  +  Ck 


-  r‘ 
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dqds 

computed 


change  each 
qk  by  3  ulp& 

€k  by  1  ulp 

^  dqds 

exact 


Z 


change  each 
Qk,  h  by  2  ulps 


Figure  1:  Effects  of  roundoff 

The  trick  is  to  write  down  the  relations  governing  the  computed  quantities  and  then 
to  discern  among  them  an  exact  dqds  transform  whose  input  is  close  to  Z  and  whose 
output  is  close  to  Z. 


Theorem  4  In  the  absence  of  underflow  or  overflow,  the  Z  diagram  given  above  com¬ 
mutes  and  qk  (ck)  differs  from  qk  (ek)  by  3  (1)  ulps,  qk  (ik)  differrs  from  qk  (ck)  by  2 
(2)  ulps. 


Proof:  We  write  down  the  exact  relations  satisfied  by  the  computed  quantities  Z, 
qk  =  (4 +  et)/(l  +  e+) 


4  = 


ek  = 


^Jb+l  — 


e.i.(l  +  (.)  =  ^><w(l  +  C/)(l  +  M(l  +  <.) 

dk  +  Ck 

{dktkjl  4-  €★)  -  T^} 

1  +  Ci+l 


(18) 

(19) 

(20) 


Note  the  difference  between  ♦  and  ★.  Of  course  all  the  e’s  obey  (17)  and  depend  on  k  but 
we  have  chosen  to  single  out  the  one  that  accounts  for  the  subtraction  because  it  is  the 
only  one  where  the  k  dependence  must  be  made  explicit.  In  more  detail  the  last  relation 


IS 


dk  +  Ck 


(1  +  6/)(l  +  6+)(l  +  e*) 


r 


2 


(1  +  ek^i)dk+i 
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(1  +  ek)dkqk+i{l  +  e/)(l  +  c+)(l  +  e^)  _ 

(1  +  ^k)d’k  +  (1  +  (k)^k 

This  tells  us  how  to  define  Z.  Note  that  €k  arose  in  the  previous  step.  Moreover 


(l4-€i)di  =qi-T^ 

(22) 

Our  choice  of  Z,  in  general,  is  not  a  machine  representable  array. 

For  k>\. 

dk  :=  (1  +  ik)dk 

e*  :=  (l  +  ei)et 

Si+i  :=  9fc+i(l  +  €/)(l  +  c+)(l  +  c*)  >  (91  =  9i) 

(23) 

(24) 

and  by  (21), 

j*  d^qk+i  _2 

Oib  +  l  =  y  ^  T 

dk  +  Cfc 

Then,  for  exact  dqds,  we  must  define 

(25) 

Qk  dk  +  ek  =  {1  +  €k){dk  +  Ck) 

w  ^kQk+l 

ek  • 

dk  +  c* 

Finally  qt  and  e*  must  be  recast  in  terms  of  Z; 

9*  =  9i/(l  +  €*)(!  +  c+)  ,  from  (18) 

(26) 

c*  =  .  (l  +  €/)(l  +  e+)(l  +  €.)  ,  from  (20) 

Ojb  +  ^k 

__  ^*^+1  (1  +  ^*) 

{dk  +  Ck)  (1  +  ^) 

(27) 

It  is  (23)  that  yields  +  =  €k/{dk  +  €k)-  Equations  (23)  and  (24)  give  the  change 

from  Z  to  and  equation  (25)  fixes  the  exact  dqds  transform.  • 


Recall  that,  in  exact  arithmetic,  algorithm  dqds  diminishes  all  eigenvalues  (of  LR)  by 
the  shift.  For  finite  precision  execution  we  have  the  following. 


Corollary  1  Algorithm  dqds  preserves  high  relative  stability.  When  Z  and  Z  are  posi¬ 
tive  then  the  associated  bidiagonal  matrices  B  and  B  (a,-  =  6,-  =  y/Fi,  etc.),  together 

with  the  associated  ideal  bidiagonals  B  and  B,  satisfy 

ai[B]  =  ai[B]  exp  {2(71  -  l)£i}, 

aflB]  =  <7f[B]  -  r^ 

(Ti[B]  =  (Ti[B]  exp  {(2n  -  l)e2}, 
for  i=  1, 2, . . , ,  n,  and  Cj  <  e,  e2  <  f  • 
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Proof;  For  i  =  1, . . . ,  n  —  1 

Si+i  =  ^Qi+i  =  a<+i'y/(l  +  c/)(l  +  c+)(l  +  f*) 
bi  =  \/^  =  ^*'^(1  H"  f«)' 

By  Theorem  2  in  DK,  the  relative  change  in  any  singular  value  in  going  from  B  to  B  is 
the  product  of  all  the  relative  changes,  namely 

jj [(!  +  €/)(!  + €+)(!  + €*)(!  + e,)]i  <exp2(n-  l)e. 

Similarly 

=  V¥i  =  «»/ ^(1  +  ei)(l  +  ^+)  ^  i  <  n 
bi-y/Fi  =  6,Y(l  +  e.)(l  +  €*)  ,  i<n 

Otfi  —  y/dn  ”  "i”  ^n)* 

The  relative  change  in  any  singular  value  in  the  transformation  from  B  to  B  is  bounded 
by 

Vl  +  Cn  IlKl  +  c.)/(l  +  c*)(l  +  €.)(1  +  €+)]i  <  exp  (4n  -  3)c/2. 

izzl 

Since  the  passage  from  5  to  ^  is  exact  the  singular  values  diminish  by  • 

Remark:  It  can  be  shown  by  similar  means  that  one  dqd  transformation  cannot  alter 
any  singular  value  by  more  than  3(n  —  1)  ulps. 

Theorem  4  is  much  stronger  than  the  familiar  error  analysis  based  on  norms  because: 


1.  The  perturbed  matrices  considered  here  inherit  the  bidiagonal  structure 

2.  The  bounds  are  very  much  smaller  than  those  from  DK  (see  Section  10)  or  the 
Golub- Reinsch  algorithm  (see  Chapter  8  of  [11]). 


For  multiple  sweeps  of  dqds,  our  results  can  be  stated  more  simply  in  terms  of  the 
positive  sequence  {Zi}  where  /  denotes  the  sweep  with  =  Zi  =  Z.  See  Figure  2  for  the 
corresponding  commutative  diagram.  Then  by  combining  Zi  — ►  Zj^i  one  obtains 

that  Zi^i  is  the  exact  dqds  transformation  of  a  perturbed  (in  relative  sense)  Zi,  Thus 
backward  stability  is  present  for  {Zj}. 

Similarly  it  can  be  shown  that  the  sequence  {Zj}  is  forward  stable  (in  relative  sense)  with 
Zf  =  Zf  where  Zf  is  the  final  computed  result.  On  exact  application  of  dqds  on  Zi  we 
get  Zi+i  instead  of  Zi+i  (see  Figure  2)  and  the  error  between  Zi^i  and  Zj+i  is  small. 

Example  3  The  following  experiment  shows  vividly  the  difference  between  an  algorithm 
that  obtains  high  relative  accuracy  (dqds)  and  one  that  does  not  (LINPACK  dsvdc 
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Zi^i - ►  Zi^2 


change  each 
Qk  by  3  ulps 
€k  by  1  ulp 

rr  dqds  7  'y  'j 

L\  - ►  - ►  Zr/+2 - ►  -^/+3 

computed 


change  each 
ejb  by  2  u/ps 

fy  d<jds  y  y  y 

/j\  - -  2//+1  L\j^2 - - ►  ^/+3 

exact 


Figure  2:  Effects  of  roundoff  for  multiple  sweeps 


based  on  the  Golub-Reinsch  algorithm)  but  which  delivers  excellent  absolute  accuracy. 
We  took  the  graded  matrix  5+, 


Q/x  —  \  9  “■  U'j 

with  a„  =  1,  n  =  8  and  /?  =  60.  We  applied  both  algorithms  to  and  its  reversal 

ai  ,  bi  bn^i. 

We  did  not  allow  any  flipping  of  the  matrix  within  the  dqds  algorithm  although  such 
flipping  improves  convergence.  See  next  section. 

In  Tables  5  and  6,  the  third  column  shows,  absi^ 

absi  :=  (a<[£_]  -  a,[B+])/<7i[5+] 

the  differences  between  outputs  scaled  by  the  2-norm  of  the  nicer  matrix.  Recall  that 
macheps  «  2"^^  «  1.1  X  10“^®.  For  dsvdc  (see  Table  5),  it  can  be  seen  that  the  absolute 
error  is  even  smaller  than  absolute  stability  guarantees. 

In  Tables  5  and  6,  the  fourth  column  shows,  re/,*, 


rdi  :=  (£r,[fi_]  -  (Tj[J3+])/<7,[5+] 
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the  relative  differences  in  the  outputs.  For  dqds  the  largest  magnitude  is  less  than  two 
macheps  (see  Table  5)  while  for  dsvdc  (see  Table  6),  it  is  very  much  larger  and  shows 
that  dsvdc  does  not  give  relative  accuracy. 


rr 

a.[£_] 

absi 

reli 

3.9590303657774160D+12 

3.9590303657774155D+ 12 

-1.2D-16 

-1.2D-16 

5.7143240472800255D+10 

5.7143240472800247D+10 

-1.9D-18 

-1.3D-16 

8.9790986853271568D+08 

8.9790986853271568D+08 

O.OD+00 

O.OD+00 

1.4489876544914651D+07 

1.4489876544914651D+07 

O.OD+00 

O.OD+00 

5 

2.3661793507020348D+05 

2.3661793507020348D+05 

O.OD+00 

O.OD+00 

6 

3 .8884661685208386D +03 

3.8884661685208386D+03 

-l.lD-25 

-1.2D-16 

6.4142972113704085D+01 

6.4142972113704109D+01 

3.6D-27 

2.2D-16 

3.5351579203702068D-01 

3.5351579203702068D-01 

O.OD+00 

O.OD+00 

Table  5:  Numerical  results  using  dqds  for  Example  3 


i 

absi 

reli 

1 

3.9590303657774146D+12 

3.9590303657774160D+12 

3.7D-16 

3.7D-16 

2 

5.7143240472800224D+10 

5.7143240472800278D+10 

1.3D-17 

9.3D-16 

3 

8.9790986853271544D+08 

8.9790986853271413D+08 

-3.3D-19 

-1.5D-15 

4 

1.4489876544914653D+07 

1.4489876544914989D+07 

8.5D-20 

2.3D-14 

5 

2.3661793507020345D+05 

2.3661793507022545D+05 

5.6D-21 

9.3D-14 

6 

3.8884661685208386D+03 

3.8884661685173243D+03 

-8.9D-22 

-9.0D-13 

7 

6.4142972113704073D+01 

6.4142972113772929D+01 

1.7D-23 

l.lD-12 

8 

3.5351579203702068D-01 

3.5351579205582154D-01 

4.7D-24 

5.3D-11 

Table  6:  Numerical  results  using  dsvdc  for  Example  3 


8  Convergence 

8,1  Linear  Convergence 

Convergence  of  the  Cholesky  algorithm  and  of  the  standard  qd  algorithm  for  tridiagonal 
matrices  (in  the  positive  case)  have  been  given  by  Rutishauser  and  others  (see  [22],  [27]). 
Nevertheless  we  present  here  a  direct  convergence  proof  for  oqd  because  it  is  a  new 
algorithm  and  the  proof  is  both  short  and  illuminating.  We  give  a  brief  discussion  of  the 
effects  of  finite  precision  on  the  results  of  the  theorem  at  the  end  of  the  section. 

The  point  is  that  a  computed  cosine  can  equal  unity  even  though  the  corresponding 
sine  Sk  may  merely  be  small. 

Our  proof  makes  use  of  the  fact  that  a  symmetric  tridiagonal  matrix  with  nonzero  super- 
diagonal  entries  (e.g.  B^B)  cannot  have  multiple  eigenvalues  [18],  [27].  In  other  words, 
the  singular  values  of  a  positive  bidiagonal  B  are  distinct 


ai  >  ^2  >  . . .  > 


(28) 
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One  must  bear  in  mind  that  distinct  a,-  may  be  equal  to  working  precision. 


Theorem  5  (Convergence  of  oqd)  From  a  positive  bidiagonal  Bi  the  unshifted  oqd 
algorithm  produces  a  sequences  of  orthogonally  equivalent  bidiagonals.  Asl  oo, 

Bi-*E  =  diag{ai, . . .  ,<t„) 

Furthermore,  if  a„  <  ai,  then  the  sequence  monotone  decreasing  in  I  from 

the  beginning.  Each  converges  linearly  to  0  with  convergence  factor  Oi+i/ai. 


Proof:  Consider  a  typical  step  of  oqd  (Algorithm  1).  Since  there  are  no  subtractions 
each  Bi  is  positive  since  Bi  is  positive. 

Equation  (7)  can  be  written  in  the  form, 

dfc  =  Cfc_iOit/ct  for  fc  =  l,...,n  (29) 

provided  that  we  set  Co  =  c„  =  1  .  Take  the  product  of  the  first  k  instances  of  (29)  to 
find 

k  k  k 

na.  =  (na,)/ct>na..  (30) 

»=1  t=l  1=1 

The  sequence  {HLi  bounded  above,  by  ||fi||*,  and  monotone  increasing  by  (30) 

and  thus  convergent.  The  limit  may  be  written  flfsi  fo  reveal  that,  as  I  —>■  oo, 


aP  Pi 

(31) 

ci')  -  1 

(32) 

4'^  ->  0. 

(33) 

By  (8)  and  (33),  as  1  -*  00, 

=  4'^4+i  ^0,  i=  l,...,n-l 

(34) 

Thus  Bi  converges  to  diagonal  form  and  each  pi  is  a  singular  value.  To  identify  pi  use 
the  product  rhombus  rule  to  find 

^  =  ®i+i Mt+i/M**  (35) 

Since  {5®}  is  bounded  (by  |l£i||)  the  limit  in  (35)  cannot  exceed  1.  By  (28)  the  limit  is 
not  1  and  thus  pk+i  <  pt,  k  =  1, . .  .,n  —  1,  and  we  may  identify  pt  as  <7*.  Thus,  (35) 
proves  that  b^P  — »•  0  linearly  with  convergence  factor  <Tk+if(Tki  as  claimed,  and  Bi  —>■  E. 

Finally  consider  0?^/  Apply  (8)  and  (6)  in  turn  to  find 


6i . . .6„_ir 


0,2  •  •  •  Ofi 


Oi 


.a. 


n~l 


,  ,  C1C2C3 

C>i  .  .  .  Ofi^i  ...  o„ 

Oi  Cl  C2  ^n— 2 

bi . .  .6„_i, 


< 
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if  Un  <0,1-  Since  B  may  be  flipped  about  its  antidiagonal  without  altering  the  singular 
values  there  is  no  loss  of  generality  in  assuming  that  an  <  ai.  In  this  case  is 

monotone  decreasing  with  I  from  the  start.  ♦ 

It  is  worth  mentioning  that  Rutishauser’s  proof  of  convergence  for  qd  (Algorithm  3),  is 
based  on  the  observation  that  ||5(fc)||jp  >  l|5jb||F  and  <  |lJ3-jbl|F5  k  = 

Here  M{±k}  is  the  submatrix  of  M  containing  the  first  (last)  k  columns. 

By  taking  the  product  of  the  final  n  —  k  instances  of  (29)  one  finds  that  {fl  is 

monotone  decreasing  in  /  for  A:  =  n  —  1,  n  —  2, . . . ,  2, 1,  In  particular  increases  and 
decreases  so  that  a  flipping  of  B  is  needed  at  most  once. 

The  assertions  of  Theorem  5  bear  up  quite  well  in  finite  precision  arithmetic.  The  se¬ 
quence  {nLi  is  monotone  nondecreasing  and  so  ^  1  as  /  oo.  Thus  in 
considering  a  sequence  of  computed  trigonometric  values  we  do  not  wish  to  infer  0 

from  1.  So  the  first  casualty  is  the  conclusion  that  s^P  — 0.  Instead  we  find  that 

becomes  negligible  relative  to  and  Even  so,  in  the  absence  of  underflow, 
the  diagonal  entries  eventually  rearrange  themselves  in  (almost)  monotone  nonincreasing 
order.  Though  distinct,  by  (28),  some  singular  values  may  be  equal  to  working  accuracy 
and  diagonal  monotonicity  may  actually  fail  by  one  or  two  nips  (units  in  the  last  place 
held)  because  the  ratio,  though  exceeding  1  might  be  too  small  to  cause  the  neighbouring 
6- value  to  grow  at  all.  All  in  all  the  practical  oqd  performs  as  closely  to  exact  oqd  as  it 
is  reasonable  to  expect. 


8.2  Quadratic  Convergence 


Consider  the  last  few  steps  in  dqds  with  shift  r: 

ffn-l  ==  dn-l  +  en^l 

6yi_i  ~  — 1 

dn  — 

Qn  —  dn* 


Hence 


js  _  ^n-lQn  _  /I  ^2! 

n-lQn  —  ^  9 

9n— 1  9n  — 1  J 


^n— l?n 

9n-l  L 


Qn-l 


(36) 


In  exact  arithmetic,  as  r  ^  (^n\Bn\  we  have  0,  e„_i  — »■  0,  q„_i  — — 

ct^[Bn]  :=  gap  >  0  because  the  singular  values  are  distinct  if  the  initial  B  is  of  positive 
type.  Thus  convergence  will  be  quadratic  with  respect  to  this  gap. 
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Expression  (36)  shows  that  if 


0  <  9„ 


-r^  <  2 


9n^n-l 

9n-l 


(37) 


then 


and  so,  by  (36), 


kn-T 


2 


1  ^  9n^n— 1 

gn-1  "  gn-l 


|en-lgn| 

(e„_ig„)2  ^,^^2  ’ 


as  T  — »•  an\Bn\-  Thus  (37)  shows  a  (theoretical)  interval  for  those  r  that  deliver  quadratic 
convergence.  Next  we  seek  a  usable  expression  that  will  ultimately  lie  in  that  interval. 


The  perfect  shift  is 

=  g„(l  -  f^) 

9n-l 

SO  that  the  natural  strategy  is  to  estimate  from 

^  /i  ^n— 2\  2 

gn-1  =  (1  -  -r - )gn-l  -  -  en-l 

Qn-2 

«  (l_^)q„_„ 

9n-2 

when  c„_2  and  e„_i  are  small  enough.  We  may  assume  that  n  >  2  and  so  may  use 

=  qn*{l-  Cn-i/qn-i  *  max{i,l  -  e„_2/9n-2})  (38) 

provided  that  0  <  d„_i  =  mini<,<„_idj  and  d„  <  d„_i. 


8.3  Cubic  Convergence 

The  assertion  in  Section  8.2  that  the  shift  yields  quadratic  convergence  for  qds 

and  dqds  appears  to  contradict  the  result  of  Rutishauser  [23]  that  this  choice  yields  cubic 
convergence.  See  also  Rutishauser  and  Schwarz  [26]  and  Chapter  8  of  Wilkinson  [29]. 
Actually,  there  is  no  anomaly  because  the  shift  strategies  are  not  quite  the  same.  In  our 
terminology  what  Rutishauser  suggests  is  that  the  qd  transform  with  should  not 

be  formed  explicitly.  The  only  item  wanted  from  it  is  q„  and  it  is  assumed  to  be  the  only 
negative  g<.  Then  it  is  shown  that  +  ?„  is  a  fourth  order  approximation  to  <t^  from 
below.  A  qd  transform  of  {o',  e}  with  shift  +  q„  will  yield  cubic  convergence. 

The  point  that  is  stressed  by  neither  Rutishauser  nor  Wilkinson  is  that  the  computation 
of  q„  costs  0{n)  operations,  very  close  to  1  step  of  qds.  From  another  perspective 
Rutishauser’s  analysis  is  a  disguised  derivation  of  the  cubic  convergence  of  the  tridiagonal 
QR  algorithm  with  the  Rayleigh  quotient  shift. 
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For  our  algorithm  dqds  Rutishauser’s  (late  failure)  shift  strategy  described  above  is  more 
appealing.  Only  the  first  phrase  of  our  algorithm  is  needed  to  compute  and  the  cost 
is  about  I  of  a  dqds  step.  Moreover  the  positive  form  is  preserved  a  little  longer. 

For  the  sake  of  completeness  we  indicate  why  is  a  fourth  order  lower  bound  when 

g„  is  second  order  in  g„e„_i.  The  relevant  tridiagonal  matrix  is  BB*  and  its  leading 
principal  (n  —  1)  X  (n  —  1)  submatrix  is  called  V.  Recall  that  Uj  is  column  j  of  /. 

Fact  1:  Provided  {V  -  a^I)  is  invertible  may  be  written  as 

Fact  2:  With  shift  =  qn, 

9n  =  ~  Qn!)  ^n-1 


Conclusion, 

+  ?n  =  [{V  -  -{V-  q„I)-^]  «„_i. 

By  Hilbert’s  second  resolvent  law, 

?n  +  ?n  =  +  {qne„.incrl  -  -  crlir\V  - 

Using  Fact  1  again, 

q.+qn  =  (Tl-  -  q„I)-\V  -  . 

The  gap  conditions  ensure  that  the  quantity  in  {  }  is  0(l/gap). 

In  contrast  to  Rutishauser  and  Wilkinson,  we  have  made  no  approximations  in  our  deriva¬ 
tion.  The  result  is  valid  so  long  as  the  inverse  matrices  exist. 


9  A  Preliminary  Implementation 

9.1  Choice  of  Shifts 

The  standard  singular  value  codes  in  LINPACK  and  LAPACK  need  about  2  QR  steps 
per  singular  value,  in  most  cases,  and  that  provides  a  hard  target  to  beat.  Moreover  one 
of  our  qd  transformations  needs  only  0(n)  fiops  and  no  square  roots  so  we  are  reluctant 
to  spend  0(n)  fiops  on  shift  selection. 

A  strategy  used  to  generate  the  numerical  tests  in  this  paper  may  not  be  the  best  but  it 
is  based  on  the  following  somewhat  surprising  observation.  The  upper  bound 

dk  =  min  di 
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is  an  increasingly  good  estimate  for  Moreover,  at  each  step,  we  will  know  the 

index  k  from  the  previous  step.  This  index  points  to  the  largest  diagonal  entry  of 
and  helps  teU  us  whether  has  yet  migrated  to  the  bottom  of  B*B.  If  >  n  —  1 
we  expect  the  trailing  2x2  principal  submatrix  of  BB*  to  give  a  good  approximation 
to  When  A:  <  n  —  1  the  matrix  is  not  yet  in  asymptotic  form  and  the  situation  is 
more  difficult.  However  we  do  know  that  y/q^  +  ejt  is  the  smallest  (leftmost)  centre  of  all 
the  Johnson  discs  for  the  matrix  of  equation  (15).  If  this  disc  were  separated  from  the 
rest  of  the  discs  then  it  would  certainly  contain  Cmin-  Even  if  it  were  not  isolated  this 
disc  might  stiU  contain  (Tmin  so  we  use  it. 


Strictly  speaking  we  need 


(flt  —  -^{^k  +  ^Ib-l))^  > 

> 


(oib  -  -s/e)^ 

9it  -  2v^  +  e 


(39) 


where  e  =  max{efc,et_i}.  When  this  disc  is  isolated  then  the  lower  bound  is  definitely 
too  small.  A  more  cautious  formula  is 

f 

mm  +  e,  0.96  *  sup} 


Since  our  estimates  may  exceed  we  must  face  the  possibility  of  rejection  of  a  shift 
r^.  How  should  the  new  shift  be  chosen  ?  On  one  hand  it  is  not  efficient  to  panic  and 
simply  bisect,  <—  \{inf  +  r^);  on  the  other  hand  we  must  avoid  being  trapped  close  to 
a  large  overestimate  of  Our  compromise  is  to  multiply  the  previous  gap  between 

sup  and  by  4  to  yield 


—  min  {4  +  (sup  —  r^), 


9.2  Splitting  and  Deflation 


In  Section  2  it  was  noted  that  if  e,-  =  0  (i.e.  6,-  =  0),  for  i  <  n,  then  the  bidiagonal  B 
splits  into  two  complementary  subdiagonals.  Consider  now  the  case  when  e,-  (or  6,)  is 
small  enough  to  permit  such  a  splitting  without  making  a  relative  change  in  any  singular 
value  exceeding  a  given  tolerance  q.  Our  situation  is  a  little  more  complicated  than  the 
one  studied  in  DK  because  of  the  non-restoring  shift.  Let  denote  the  cumulative  sum 
of  all  shifts  used  on  the  given  matrix  in  the  qd  algorithm  (which  computes  the  squared 
singular  values). 


Our  criteria  are  based  on  Weyl’s  monotonicity  theorem  for  eigenvalues  of  symmetric 
matrices.  Consider  a  bidiagonal  B  and  the  possibility  of  neglecting  6j  for  a  given  i  <  n. 
Write 
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where  A,-,j  is  a  null  matrix  except  that  the  {i,j)  is  unity.  There  are  two  cases.  Case  I: 


Rt  R  .  _  /  SlBi  +  cr^I  a<6,A,,i 

^  ^  ^  a.6.Ai,  +  6?Ai,i 


Case  II: 


BB^  +  0-2/ 


BiB[  +  6?Aj,i  +  (T^I  a,+i6,Ai.i 

a<+i6jAi,,-  B2B2  +  cr^I 


The  spectral  norm  of  the  matrix 

is  |a|  and  it  is  submatrices  of  this  form  that  we  will  remove.  Weyl’s  theorem  states  that 
on  subtracting  a  principal  submatrix  of  the  above  form,  no  eigenvalue  is  changed  by  more 
than  |a|.  Apply  this  result  to  Case  I  and  conclude  that  if 

(40) 

then  there  are  i  indices  j  such  that 

Xj[B*B  +  a^I]  =  \j[B\Br  +  +  Cjf 

with  Cj  <  77.  Here  X[M]  is  an  eigenvalue  of  M.  Recall  that  B  is  not  the  original  matrix 
whose  singular  values  are  to  be  found.  However  the  eigenvalues  of  B*B  +  o^I  are  the 
squares  of  the  wanted  numbers. 

Applying  Weyl’s  theorem  to  Case  II  shows  that  if 

ai^ihi<2ri{a^  +  al,i„[B2])  (41) 

then  there  are  n  —  i  +  1  indices  j  such  that 

Xj[B^B  +  a^I]  =  Xi[B2B\  + 

with  6;  <  rf. 

Since  crmin[Bi]  and  crmin[B2]  are  not  usually  known  (unless  i  =  1,2,  or  n  —  l,n)  the 
criteria  given  above  must  be  replaced  by  a  more  exacting  one 

6,-  <  +  inf)/  max{a<,a,+i}  (42) 

where  inf  is  the  best  lower  bound  on 


The  conditions  (41)  and  (42)  are  more  severe  than  DK’s 


bi  <  Wmin[B] 


(43) 


when  <  inf  since  maxai,ai+i  >  y/inf.  However  after  the  tiny  eigenvalues  are  found 
it  is  not  fair  to  compare  6,*  with  the  current  B.  In  principal  (42)  and  (43)  can  be  merged 


e.  <  rf  max  {in/.  4(<r=  +  in/)  (_^!+{^)  | 


When  i  =  71—1  or  n  —  2  criterion  (40)  tells  us  when  to  deflate  the  bottom  1  x  1  or  2  x  2 
submatrix  from  B  and  be  sure  that  the  singular  values  of  Bi  are  adequate  approximations 
to  the  remaining  singular  values  of  B. 
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9.3  Performance  of  a  Prototype  Implementation 

We  have  developed  and  implemented  dqds  in  FORTRAN  77  to  exploit  and  study  the 
theory  we  have  developed  in  this  paper.  This  prototype  program  is  built  in  modular 
fashion. 

We  have  run  our  code  on  a  broad  test  bed  of  bidiagonals.  Here  we  report  on  comparisons 
on  three  interesting  classes  using  our  dqds  and  LINPACK’s  dsvdc  (with  reduction  to 
bidiagonals  removed). 

Example  4  (nice  matrices)  We  considered  the  graded  matrix  defined  earlier  in  Ex¬ 
ample  3  with  the  parameter  /3  =  2  and  n  =  30.  Table  7  gives  the  performance  of  this 
example  and  other  examples  in  this  section.  The  speedup  is  the  time  taken  by  dsvdc 
divided  by  dqds. 

We  have  also  tested  this  problem  with  n  =  40  and  in  that  case  the  LINPACK  dsvdc 
returned  with  an  error  flag  as  it  could  not  compute  <740  within  30  iterations.  We  were 
prevented  from  comparing  with  larger  values  of  n  because  dsvdc  reported  errors.  • 

Example  5  (perversely  graded)  To  make  conditions  artificially  difllcult  for  dqds,  we  also 
ran  the  programs  with  the  reversely  graded  matrix  as  the  input  with  n  =  30  and 
/?  =  2.  See  Table  7  for  details. 

Usually,  our  dqds  will  flip  5.  to  obtain  B^.  H  the  user  does  not  flip  before  calling 
dsvdc  then  the  time  ratio  goes  up  to  18.9. 

dsvdc  also  failed  to  converge  for  many  combinations  of  /?  and  n.  • 

Example  6  Let  B^  be  the  Wilkinson- type  bidiagonal  matrix  where 

Tt 

+  1|  ,  i=:l,...,n 

bi  =  1  ,  i  =  l,..,,n  —  1 

This  matrix  has  close  eigenvalues  (twins)  and  our  current  coding  does  not  fully  exploit 
this  structure.  Hence  the  low  performance  compared  with  the  previous  results.  • 

Example  7  Doubled  Wilkinson- type  matrices,  B2w^ 

Uj  —  +  1|  j  t  —  1, . . . ,  ~ 

n  . 

tt,.}.  »  —  (li  ,  t  —  — 

6,  =  1  ,  i  =  l,...,n~  1 

with  n  =  41.  This  matrix  has  close  singular  values  (quads)  and  some  of  them  are  exactly 
equal.  Table  7  gives  the  details.  • 
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Example  8  Toeplitz  matrix  Bt, 


a<  =  1  ,  6,-  =  2. 

For  n  =  100,  the  matrix  has  a  tiny  singular  value;  others  are  between  1  and  3.  • 


Example 

Matrix 

n 

dqd  sweeps 

dsvdc  sweeps 

speedup 

4 

5+ 

30 

52 

60 

10.2 

5 

B. 

30 

79 

101 

12.3 

6 

B^ 

21 

78 

62 

4.8 

7 

B2W 

41 

230 

120 

4.8 

8 

Bt 

100 

374 

308 

11.0 

Table  7:  Performance  comparison 


10  The  Demmel/Kahan  Paper 


We  summarize  the  highlights  of  this  impressive  contribution  [4], 


10.1  High  Relative  Accuracy 

Corollary  2  of  Theorem  2  of  DK.  Suppose  that  (5+^5),* 
a,-  0.  Define 

2n-l 

a  :=  n  max{la,|,|ari|}. 

»  =  1 

Let  o-j  >  cTj  >  . . .  >  be  the  singular  values  of  B  +  SB.  Then 

(Ti/a  <  <  aide  ,  i  =  l,2, ...,71. 

This  shows  that  bidiagonal  matrices  determine  their  singular  values  to  high  relative 
accuracy. 


10.2  Bounds  for  <7„ 

It  is  possible  to  compute  ||J5'  -^loo  and  using  2(n— 1)  divisions  and  multiplications. 

The  algorithm  is 

Aj- :=  Oj(A_,+i/(Aj+i  +  6j))  ,  j  =  n  -  l,n- 2,...,1  with  A„  :=  a„ 

^lj+l  :=  ,  j  =  1  with  :=  Oi 

||5“^||oo  =  1/ininAj- 
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||5  ^lli  =  1/mm/i;-. 

Finally, 

n-i/2jQax{||B-^||-M|5-^llr^}  <  <t„  <  n^/2min{|15-^||-M|B-^llr^} 

and 


10.3  A  Stopping  Criterion 

Let  77  <<  1  be  the  desired  relative  accuracy  of  the  computed  singular  values.  Then  if 
either 

h/^J+l  <V  OT  bj/flj  <  T] 

set  bj  to  zero  and  the  two  pieces  into  which  B  splits  may  be  processed  separately.  The 
criteria  used  in  LINPACK  [5]  can  sometimes  deliver  a  zero  singular  value  when  it  should 
not  and  can  sometimes  fail  to  suppress  a  negligible  off  diagonal  entry  bj. 


10.4  Bidiagonal  QR  with  Zero  Shift 

The  standard  Golub/Reinsch  algorithm  [11],  [10]  used  in  LINPACK  may  be  simplified 
when  no  shifts  are  used.  Of  more  importance  is  the  fact  that  in  this  case  all  round  off 
errors  arise  multiplicatively.  Moreover  for  the  calculation  of  tiny  singular  values  zero  is 
a  good  shift  and  it  pays  to  compute  them  first  rather  then  letting  the  standard  shift 
strategy  dictate  the  order  in  which  the  singular  values  are  found.  The  arithmetic  effort 
in  the  innermost  loop  is 

Golub/Reinsch:  2  calls  to  ROT  +  12  multiplications  +  4  additions 
Demmel/Kahan:  2  calls  to  ROT  +  4  mtiltiplications. 

The  procedure  ROT  computes  the  sine  and  cosine  needed  for  a  plane  rotation  using  2 
divisions,  3  multiplications,  and  1  square  root.  Here  is  the  algorithm. 


oldcs  :=  cs  :=  1 
for  i  :=  1,» -  1 
call  ROT{ai  *  cs,  bi,  cs,  sn,  r) 
if  (i  ^  1)  6,_i  :=  oldsn  *  r 
call  ROT{oldcs  *  r,  *  sn,  oldcs,  oldsn,  bi) 
h  :=  a„  +  cs;  6„_i  :=  h*  oldsn;  a„  :=  h*  oldcs 
end  for 


(45) 
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In  the  absence  of  underflow  the  error  bound  on  singular  values  after  one  zero  shift  bidi- 
agonal  QR  transform  is 


w 


\  —  w 


cr,- 


where 


w  :=  <  1. 


See  Theorem  6,  p.  906. 


10.5  The  Overall  Algorithm 


if  (roundoff  in  any  shift  exceeds  tol  *  bound  on  «7„)  then 
use  zero-shift  QR  or  QL 
else 

use  shifted  QR  or  QL 
end  if. 

10.6  Other  Improvements 

The  new  code  uses  either  QL  or  QR  as  appropriate  according  to  the  way  B  is  graded. 

An  efficient  accurate  subroutine  is  provided  to  return  the  singular  values  and  vectors  of 
2x2  bidiagonal  matrices. 

Deflation  when  a  diagonal  entry  o,-  vanishes  is  automatic  and  occurs  at  the  bottom  or 
top  of  B. 


11  Evolution  of  qd 


Some  of  the  available  presentations  of  the  qd-algorithm,  see  [22],  [27],  [12]  show  its 
close  connection  with  factorization  of  tridiagonal  matrices  but  some  do  not  [13],  [28]. 
Nevertheless  its  discovery  had  nothing  to  do  with  matrix  decompositions  and  a  knowledge 
of  the  origins  helps  us  to  understand  the  somewhat  neglected  status  of  the  algorithm. 
In  the  next  few  paragraphs  we  sketch  an  earlier  paper  [17]  which  described  the  gradual 
evolution  of  the  qd-algorithm. 

The  story  begins  with  Daniel  Bernoulli  in  1728  when  he  showed  that  the  largest  and  the 
smallest  roots  of  an  nth  degree  polynomial  can  be  obtained  by  iterating  an  rath  degree 
difference  equation.  See  [2].  The  work  of  Bernoulli  was  extended  by  Euler  in  1748.  See 
Chapter  17  of  [7]  (English  translation  [6]). 
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We  axe  given  a  rational  function  of  a  complex  variable  z, 

/(^)  =  ) 
k=0 

assumed  to  be  regular  (analytic)  at  both  2  =  0  and  2  =  oo.  The  Taylor  series  converges  to 
f{z)  only  within  a  circle  (in  C)  centred  at  2  =  0  and  extending  up  to  the  nearest  pole  pi. 
However,  by  analytic  continuation,  the  Taylor  coefficients  {hk}  actually  define  a  unique 
rational  function  /  on  all  of  C  except  the  poles  pi,P2,P3,  •  •  ••  The  problem  is  to  determine 
the  poles  directly  from  the  {hk}  without  having  recourse  to  analytic  continuation. 

In  1884  Konig  [16]  showed  that  if  pi  is  a  simple  pole  and  smaller  than  aU  the  others  then 

lim  (hi/ht+i)  =  pi. 

♦'00 


Exactly  one  hundred  years  ago  (i.e.  in  1892),  in  his  dissertation,  J.  Hadamard  showed 
that 


where 


hje 

hk^i 

•  •  *  1 

=  det 

hk+1 

hk-^2 

•  • •  hk^m 

hk-^m 

•  •  •  ^ib+2m-2 

The  are  called  Hankel  determinants. 


It  follows  that 


p,r,  =  lim 

Jk-*oo 


\  J  ’ 


The  solution  is  brilliant  but  does  not  give  us  a  practical  algorithm. 

During  the  1920s,  in  Scotland,  A.  C.  Aitken  rediscovered  for  himself  a  remarkable  con¬ 
nection  among  Hankel  matrices  that  was  known  to  Hadamard  but  not  exploited  by  him; 

(H*  )^  +  (46) 


The  relation  (46)  permits  the  computation  of  all  the  without  being  drowned  in 
determinantal  equations. 

The  blemish  in  (46)  is  that  the  are  not  of  direct  interest.  We  want  to  compute 
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^1 

e(2) 

• 

.(3) 

62 

Figure  3:  qd  in  a  (modified)  difference  table 
Rutishauser’s  clever  observation  was  that  if  one  introduces 

•  niH!L+^ 


then  (46)  implies 

the  additive  rhombus  rule,  while  the  definitions  of  q  and  e  give  the  product  rhombus  rule 


Hm  I  Hm  '  ^m-1  ? 


The  rhombus  rules  were  introduced  at  the  end  of  Section  4.  The  qs  and  es  are  best  laid 
out  in  a  tableau  that  is  like  a  difference  table.  See  Figure  3. 

This  qd  table  may  be  built  up  via  the  rhombus  rules  either  from  column  1  or  from  the 
top  diagonal.  The  first  column,  is  at  hand,  since 

=  hu/K-,  ,  k>  I 


and 


.(*)  _  J*+i)  _  Sh) 
€i  —  “  9i  • 


This  is  far  simple  than  Hadamard’s  solution  but,  in  finite  precision  arithmetic,  it  is 
hopelessly  unstable  because  the  later  e’s  are  (modified)  differences  of  converging  values. 

Fortunately  computation  along  descending  diagonals  is  stable  but  here  the  difficulty  is 
the  calculation  of  the  tcp  diagonal.  This  is  not  as  daunting  as  it  appears  at  first.  If  the 
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function  /(C)  has  only  n  poles  then  all  q  (and  e)  columns  beyond  the  nth  vanish.  Then 
it  suffices  to  build  the  n  x  n  Hankel  matrix  H°  and  compute  its  triangular  factorization 

Hi  =  L„D„Ll 

where  D„  =  diag(di, .  ..,dn)  holds  the  successive  pivots.  It  turns  out  that 

qf^  =  HllHl_,  =  dk  ,  k  =  l,...,n 

The  are  found  from  the  pivots  of  This  is  the  practical  way  to  compute  the 
poles  from  the  Taylor  coefficients.  In  fact  a  careful  form  of  row  interchanges  (not  partial 
pivoting)  may  be  used  to  improve  the  accuracy  of  the  factorization. 

Next  we  relate  the  qd  tableau  to  the  computation  of  eigenvalues.  Given  a  square  matrix 
C  the  appropriate  rational  function  /  comes  from  the  resolvent, 

f(z)  =  x\I-zC)-^y, 

where  x  and  y  are  column  vectors.  A  technical  assumption  is  needed  to  guarantee  that 
the  qd  tableau  is  well  defined.  In  the  language  of  control  theory,  see  [15],  the  linear 
dynamical  system  S{C,y,x*)  must  be  minimal.  If  it  is  not  minimal,  then  we  might  not 
be  able  to  find  all  the  poles  of  the  system. 

For  this  function  /, 

h).  =  x*C'‘y/x*C'’~^y 

as  so  the  {fit}  could  be  computed  by  the  power  method.  However,  it  would  be  preferable 
to  compute  ^  directly  from  C  and  we  now  know  that  this  can  be 

done  by  invoking  the  Lanczos  algorithm  on  C  and  using  the  resulting  tridiagonal  matrix 
J.  It  turns  out  that  the  pivots  that  occur  in  computing  the  triangular  factorization  of  J 
are  the  and  their  reciprocals  are  the  The  details  are  given  in  [17].  In  other 

words. 


1 

* 

1 

■ 

®1 

1 

1 

C2 

1 

4“^ 

1 

1 

(0) 

94  ^ 

.  . 

• 

We  see  here  how  the  LR  algorithm  on  tridiagonals  was  hidden  in  the  qd  table. 

12  The  Continued  Fraction  Connection 


There  is  an  intimate  connection  between  our  bidiagonal  matrix  B,  the  tridiagonal  ma¬ 
trix  T  =  B^B,  and  a  continued  fraction  associated  with  them.  Properties  of  continued 
fractions  influenced  the  qd  algorithm  initially  and  only  later  did  the  LR  transformation 
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emerge  and  nearly  displace  the  continued  fraction.  We  can  not  find  any  discussion  by 
Rutishauser  of  the  connection  between  the  continued  fraction  and  B^B  so  we  supply  it 
here. 

Recall  the  notation  from  Sections  2,  3,  and  4. 


T  =  tridiag  < 


=  bidiag  < 

bi  62 

•  ^n-2 

«  }• 

1 

dl  Q>2 

•  •  ®n— 1 

On  J 

9. 

=  «.? 

c?  =  6? 

0 

II 

e 

II 

0 

■v/9l^ 

^92  62 

V9363 

\/9n-l6n-l 

9i 

92  +  61 

93  +  62 

9n  +  6„_i 

V9363 

■\/9n-l6n-l 

Rutishauser  associates  with  T  the  continued  fraction 

1  Ciqi  6292 


C-9i-  C-92-ei-  C -92-62- 


(47) 


It  is  not  obvious  how  F{C)  relates  to  T.  The  answer  is 


F(c)  =  [(a-r)-^]i,i 


or  more  generally, 


F{0  =  x\ci-T)-^y 


with  X  and  y  as  defined  near  the  end  of  Section  11.  The  inverse  is  well  defined  for  aU  C 
with  Id  exceeding  the  spectral  radius  of  T.  The  particular  form  of  the  continued  fraction 
arises  from  the  triangular  factorization  of  Cl  —  T  from  the  bottom  up: 


CI-T=  Vbi 


where  L  is  unit  triangular  and 


Then 

and 


The  recurrence  for  the  dj  is 


]D  diO’Q^d'^^  d^y . . . ,  d„) 

di  =  d.(C). 

iCi-T)-^  =  i-^b-^L-* 

nc)  =  dV 


dn  ■=  C  -  9'n  -  fin-l 

dj  :=  C-9j-  e;-i  -  qjejfdj+i  for  j  =  n  -  1, . . . ,  2, 1. 


(48) 
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This  establishes  (47). 


There  is  a  simpler  continued  fraction  expansion  for  F{C).  It  corresponds  to  a  recurrence 
for  dj  +  €j^i.  From  (48) 


dj  4-  €j—i 


C  -  9j(i  + 


) 


C-9i/ 


\  dj+i  +  ej  j 


C  9j/ (l  ®i/(4;+i  "I"  ®i)) 


(49) 


Since  Cq  =  0,  (49)  gives 


m = iM  =  ^  ^ 


iJL  iL 

C-  1- 


^2 


This  form  is  remarkable  for  the  direct  connection  of  qi  and  to  the  (1,1)  entry  of 
(CI-T)-K 
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